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Abstract 



In this paper we propose an approximated structured prediction framework for large 
scale graphical models and derive message-passing algorithms for learning their parameters 
efficiently. We first relate CRFs and structured SVMs and show that in CRFs a variant of 
the log-partition function, known as soft-max, smoothly approximates the hinge loss func- 
tion of structured SVMs. We then propose an intuitive approximation for the structured 
prediction problem, using duality, based on local entropy approximations and derive an 
efficient message-passing algorithm that is guaranteed to converge to the optimum for con- 
cave entropy approximations. Unlike existing approaches, this allows us to learn efficiently 
graphical models with cycles and very large number of parameters. We demonstrate the 
effectiveness of our approach in an image denoising task. This task was previously solved 
by sharing parameters across cliques. In contrast, our algorithm is able to efficiently learn 
large number of parameters resulting in orders of magnitude better prediction. 



1. Introduction 



Unlike standard supervised learning problems which involve simple scalar outputs, struc- 
tured prediction deals with structured outputs such as sequences, grids, or more general 
graphs. Ideally, one would want to make joint predictions on the structured labels instead 
of simply predicting each element independently, as this additionally accounts for the sta- 
tistical correlations between label elements, as well as between training examples and their 
labels. These properties make structured prediction appealing for a wide range of applica- 
tions such as image segmentation, image denoising, sequence labeling and natural language 
parsing. 

Several structured prediction models have been recently proposed, including log-likelihood 
models such as conditional random fields (CRFs, Lafferty et al. (2001)), and structured 



support vector machines (structured SVMs) such as maximum-margin Markov networks 



(M3Ns Taskar et al. (2004)) and structured output learning ( Tsochantaridis et al. (2006)). 



For CRFs, learning is done by minimizing a convex function composed of a negative log- 
likelihood loss and a regularization term. Learning structured SVMs is done by minimizing 
the convex regularized structured hinge loss. 
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Despite the convexity of the objective functions, finding the optimal parameters of these 
models can be computationally expensive since it involves exponentially many labels. When 
the label structure corresponds to a tree, learning can be done efficiently by using belief 
propagation as a subroutine; The sum-product algorithm is typically used in CRFs and 
the max-product algorithm in structured SVMs. When the label structure corresponds to 
a general graph, one cannot compute the objective nor the gradient exactly, and usually 



resorts to approximate inference algorithms, e.g. Finley and Joachims 



(2006) for structured SVMs and Taskar et al. (2002); Levin and Weiss (2006); Yanover et al. 



(2008); Taskar et al. 



(2007) for CRFs. However, the approximate inference algorithms are computationally too 
expensive to be used as a subroutine of the learning algorithm, therefore they cannot be 
applied efficiently for large scale structured prediction problems. Also, it is not clear how to 
define a stopping criteria as it approximates the objective and gradient, and as a consequence 
the objective does not monotonically decrease, and may result in poor approximations. 

In this paper we propose an approximated structured prediction framework for large 
scale graphical models and derive message-passing algorithms for learning their parameters 
efficiently. We relate CRFs and structured SVMs, and show that in CRFs a variant of 
the log-partition function, known as soft-max, smoothly approximates the hinge loss func- 
tion of structured SVMs. We then propose an intuitive approximation for the structured 
prediction problem, using duality, based on a local entropy approximation and derive an 
efficient message-passing algorithm that is guaranteed to converge to the optimum for con- 
cave entropy approximations. Unlike existing approaches, this allows us to learn efficiently 
graphical models with cycles and very large number of parameters. We demonstrate the 
effectiveness of our approach in an image denoising task. This task was previously solved 
by sharing parameters across cliques. In contrast, our algorithm is able to efficiently learn 
large number of parameters resulting in orders of magnitude better prediction. 

The rest of the paper is organized as follows. In Section [2] we review regularized loss 
minimization focusing on its most common models, CRFs and structured SVMs. We relate 



CRFs and structured SVMs in Section 2.1, and present the corresponding graphical models 
in Section 2.2 We present our approximate prediction framework in Section [3j derive a 
message-passing algorithm to solve the approximated problem efficiently in Section [4j and 
show our experimental evaluation in Section [5] 

2. Regularized Loss Minimization 

Consider a supervised learning setting with objects x £ X and labels y G y. In structured 
prediction the labels may be sequences, trees, grids, or other high-dimensional objects with 
internal structure. Consider a function $ : X x y — > H d that maps (x, y) pairs to feature 
vectors. Our goal is to construct a linear prediction rule 

ye{x) = argmax 9 T <fr(x, y) 

yey 

with parameters 9 G such that ye{x) is a good approximation to the true label of x. 
The parameters 9 are typically learned by minimizing the regularized loss 

Y, W,x,y) + -\\o\\ p P , (i) 

{x,y)eS 
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defined over a training set S. The function £ measures the loss incurred in using ye(x) to 
predict the label of x, given that the true label is y. 

In this paper we focus on structured SVMs and CRFs which are the most common 
structured prediction models. The first definition of structured SVMs used the structured 



hinge loss, originally introduced by Taskar et al. (2004) 

^■hinge 



(9, x, y) = max \e(y, y) + T $(x, y) - T $(x, y)\ 



where e(y, y) is some non-negative measure of error when predicting y instead of y as the 
label of x. We assume that e(y, y) = 0, so that no loss is incurred for correct prediction. 
This loss function corresponds to a maximum-margin approach that explicitly penalizes 
training examples (x,y) for which 9 T <fr(x,y) < e(y,yg(x)) + 9 T <fr(x, yg(x)). 

The second loss function that we consider is based on log-linear models, and is commonly 



used in CRFs (Lafferty et al. (2001)). Let the conditional distribution be 



p(y\o x , y ) = -\ ex p ( e v(y) + 9 T *( x > y)) > y) = ^2 ex p ( e v(y) + o T ®& v) 



where e y (y) = e(y,y) corresponds to a prior distribution, and Z(x,y) is the partition 
function. The loss function is then the negative log-likelihood under the parameters 9 

£i og (9,x,y) = In- 



p{y\°x,y)' 

In structured SVMs and CRFs a convex loss function and a convex regularization are 
minimized. 

2.1 One parameter extension of CRFs and Structured SVMs 

In CRFs one aims to minimize the regularized negative log-likelihood of the conditional 
distribution p(y\9 x ^ y ) which decomposes into the log-partition Z(x^y) and the linear term 
9 T <fr(x, y). Hence the problem of minimizing the regularized loss in M with the loss function 
ilog can be written as 

(CRF) mini ^ In Z(x, y) - d T 6> + - 1|0||£ 

where (x,y) G S ranges over training pairs and d = ^ x ^ g5 <E>(x, y) is the vector of 
empirical means. In gradient based methods, a coordinate 9 r is updated in the direction 
of the negative gradient, for some step size rj. The gradient of the log-partition function 
corresponds to the probability distribution p(y\9 x ^ y ), and the direction of descent takes the 
form 

/2 /2p(y\0x,y)M x ,y) - d r + |6 l r | p ~ 1 sign(6' r ). 

(x,y)&Syey 

Structured SVMs aim at minimizing the regularized, hinge loss ^hinge 

(9,x,y), which 

measures the loss of the label ye{x) that most violates the training pair (x,y) G S by more 
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than e(y,yg(x)). Since ye(x) is independent of the training label y, the structured SVM 
program takes the form: 



(structured SVM) 



V max{e(y,y)+0 T $(x,y))-d T + -||0 



mm 
9 



(x,y)GS 



where (x, y) £ S ranges over the training pairs, and d is the vector of empirical means. 
The structured SVM objective involves the max function, hence it is not smooth. However 
every convex function f{0) has a subdifferential, denoted by df(6), which is the family of 
subgradients, i.e. supporting hyperplanes to its epigraph at (0,f(0)). The subdifferential 
generalizes the concept of the gradient since a convex function is smooth if and only if its 
subdifferential consists of a single vector, i.e. its gradient (cf. Bertsekas et al. (2003), The- 
orem 4.2.2). Danskin's theorem (cf. Bertsekas et al. (2003), Theorem 4.5.1) states that the 
subdifferential of the max function corresponds to the probability distributions p{y*\O x ,y) 
over the optimal set 3^* = argmax^g-y {e(y, y) + 9 <E>(x, y)}, therefore the subdifferential of 
the structured SVM takes the form 

(x,y)eSy*ey* 

Unlike the smooth case, a negative subgradient not necessarily points towards a direction 
of descent, therefore subgradient methods are usually not monotonically decreasing, and 
depend on the step size. Their optimal solution is taken from the algorithm sequence. 

In the following we deal with both structured prediction tasks (i.e., structured SVMs 
and CRFs) as two instances of the same framework, by extending the partition function to 
norms, namely Z e (x,y) = || exp (e y (y) + T <fr(x,y)^ \\\/ e , where the norm is computed for 
the vector ranging over y £ y. Using the norm formulation we move from the partition 
function, for e = 1, to the maximum over the exponential function for e = 0. Equivalently, 
we relate the log-partition and the max-function by the soft-max function 



\nZ e (x,y) = e 




e y (y) + T <f>(x,y) 



(2) 



For e = 1 the soft-max function reduces to the log-partition function, and for e = it reduces 
to the max-function. Moreover, for e — > the soft-max function is a smooth approximation 
of the max-function, in the same way the £ 1 / e -norm is a smooth approximation of the Ioq- 
norm. This smooth approximation of the max-function is used in different areas of research, 
e.g. IVontobel and Koetterl (|2006|); I Johnson et al.l (120071). We thus define the structured 



prediction problem as 



C, 



mm 
e 



(structured-prediction) 

which is a one-parameter extension of CRFs and structured SVMs, i.e., e = 1 and e = 



Y inZ e (x, y )-d T e + -\\e\\p 

(x,y)€S p 



(3) 



respectively. Similarly to CRFs and structured SVMs (Lebanon and Lafferty (2002); Ratliff 
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et al. (2006)), one can use gradient methods to optimize structured prediction. The gradient 



of 9 r takes the form 



^2 ^Pe^e^rix^-dr + ^-hlgViier), (4) 

(x,y)&Syey 



where 



PMe *« ] = Z&W' 6XP { e J (5) 

is a probability distribution over the possible labels y £ y. When e — >• this probability 
distribution gets concentrated around its maximal values, since all its elements are raised 
to the power of a very large number (i.e. 1/e) normalized by Z e (x,y). Therefore for e = 
we get a structured SVM subgradient. 

One can compute the soft-max (|2~|) and the probability p € (y\O x ^ y ) in ^ using variational 
methods (cf. Wainwright and Jordanj ( |2008 ) , Theorem 8.1) since 

lnZ t (x,y) = max YVy) (e y (y) + T $(x, yj) + eH(p), (6) 

where Ay is the set of all probability distribution over y, namely p(y) > 0, YlyPiv) = 1> an( ^ 
H(p) is the entropy of the probability distribution p(y). One can verify that the distribution 
Pe(y\9x,y) hi ^ is the optimal argument of the program in ([6]), by differentiating and finding 
the vanishing point of the gradient. 



2.2 Structured Prediction in Graphical Models 

In many real-life applications the labels y E y are n-tuples, y = (yi,---,y n ) for y v £ y v , 
hence there are exponentially many labels in y. The feature maps usually describe relations 
between subsets of label elements y a C {y±, ...,y n }, and local evidence on single label 
elements y v , namely 

(f>r(x,yi, ...,y n ) = ^ <f>r,v{x,y v ) + ^ <pr,a(x,y a ). (7) 

Each feature (f> r (x, y) can be described by its factor graph G r , x , a bipartite graph with one 
set of nodes corresponding to V r>x and the other set corresponds to E r ^ x . An edge connects 
a single label node v £ V r>x with a subset of label nodes a £ E TjX if and only if y v £ y a . 
In the following we consider the factor graph G = UG rjX which is the union of all factor 
graphs. We denote by N(v) and N(a) the set of neighbors of v and a respectively, in the 
factor graph G. For clarity in the presentation we consider fully factorized priors 

n 

e y (yi, :.,y n ) = /^CyAVv), 

v=l 

although our derivation naturally extends to any graphical model representing the interac- 
tions e y (y). 
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The local structure on the features (fn) in turn induces a local structure on the learning 
problem Q. In particular, the gradient Q takes the local structure 



^ ^ Pe{yv\0 x ,y)4>r,v{x,y v ) + ^ Pe (j)a I Ox,y )4>r,a (x, jja) \ ~ d r + \0 r \ P 1 sign(6' r ), 

(x,y)eS \y v &yv y a &y a ) 

which involves the averages of the local features (fr r ,v(x,y v ), 4>r,a(x,y a ) with respect to 
marginal probabilities Pe{y v \Qx,y) and Pe(y a \6x,y)- Moreover, using the variational method 
one can observe a local structure on the linear terms of the soft-max function 

\nZ e (x,y) = max S~] p{y v )<j> v (x, y v ) + V" p(y a )(t) a (x,y a ) + eH(p), 
p(y)6A y ^ ^ 
yv 6y» Da e^a 

where p(y v ),p(ya) are the marginal probabilities of p(y). Also, we used a short notation for 
the 6 weighted features in the graphical model 4> v (x,y v ) = e y {y v ) + ^ ri)6 y ri O r (f> r)V {x,y v ) 
and 4> a (x, y a ) = Ylr-aeE ^r^r,a(x, jja)- We note that in general the soft-max function does 
not admit a fully local structure due to the entropy function. 

When the factor graph has no cycles, the probability distribution can be represented by a 
multiplication of its marginal probabilities p(y) = Y\ a p{y a ) Y\ v Piyv) 1 '^^ ■ Therefore the 
entropy H(p) can be represented by sum of local entropies over the marginal probabilities, 
^ a H(p(y a )) — X^d(1 — \N(v)\)H(p(y v )), which is known as the Bethe entropy. In this case, 
using the variational method ([6]), the soft-max function In Z e (x,y) admits the fully local 
structure 

max V p(y v )M x ,yv)+ V p(i)i(j;,i)+£ Vff(p(i))-V(i 

y iiv&yv y a ey a V a v 



\N(v)\)H(p(y v )) 

and the marginal probabilities p e (y v \6 Xjy ),p e (y a \8 X! y) are the optimal arguments of the local 



variational program (Yedidia et al. (2005)). This describes a way to compute the soft-max 
function in the structured prediction objective ^ and the marginal probabilities for the 
gradient Q without considering exponentially many elements y G y for graphs without 
cycles. This enables to explicitly determine a step size to ensure the change of 9 r in the 
negative gradient direction reduces the structured prediction objective. 

In general, when the graphical model has cycles, the structured prediction objective is 
exponentially hard to compute since the soft-max, In Z e (x,y), considers exponentially many 
elements y G y. Similarly, the gradient involves summing over exponentially many elements 
since it requires the marginal probabilities Pe{yv\Qx,y) an d p e (ya\Qx «)• I n the following we 
describe the approximate inference framework which is typically used to approximate the 
soft-max and the marginal probabilities. The approximate inference framework is based 
on the variational method in ^ which derives the soft-max by the marginal probabilities 
when the factor graph has no cycles. The main idea is to replace the marginal probabilities 
p(y v ),p(ya) with beliefs b v (y v ), b a (y a ) and the entropy term by sum of local entropies. The 



6 



Approximated Structured Prediction 



approximation of ^ takes the form: 

In Z e (x,y) « max b v (y v )<j) v (x, y v ) + ^ b a (y a )^ a (x, y a ) + e I c a H(b a ) + ^ CyH(b v ) J , 
subject to &„(&,) G Ay„, G Ay a , ^ &«(&*) = ( 8 ) 



where A^ is the set of probability distributions over y v , i.e. b v (y v ) > 0, J2y v bv(Vv) = 1> 
and Ay Q is the set of probability distributions over y a . 

The local entropy approximation ^2 a c a H(b a ) + ^2 v c v H(b v ) in ([8]) is known as the 
fractional entropy approximation of Wiegerinck and Heskes ( 2003| ). When the graphical 



model has no cycles, and the fractional entropy weights are chosen according to the Bethe 
entropy, i.e. c a = l,c„ = 1 — |JV(u)|, then variational program in Q is equivalent to 
the one in ([6]); it gives an exact characterization of the soft-max, and its optimal beliefs 
are the true marginal probabilities. However, when the graphical model has cycles this 
variational program is an approximation of the soft-max and the marginal probabilities, 
and no guarantees on the quality of the approximation are known so far. 

To compute the soft-max and the marginal probabilities, Pe(y v \Qx,y) and p e (y a \6 x ^ y ), 
exponentially many labels have to be considered. This is in general computationally pro- 
hibitive, but when the factor graph has no cycles this can be done efficiently by the 
belief propagation algorithms which send messages along the edges of the factor graph 
( |Pearl (JT988)) . However, in the presence of cycles inference can only be approximated 



(refeq:approx). In the following we present a message-passing algorithm for solving the 



(2005a b 


); 


Heskes (2006 


); Meltzer 



tion algorithms we set 4> v (y v ) 
messages can be computed as 



exp((f) v {y v )), and 4> a (y a ) = exp(^ a (y a )). In this case the 



(yv) 



l (y°^ Yl n u->a(yu, 
u£N(a)\v 



n v^a(yv) OC 



l/eca 



(yv) 

j3£N(v)\a 

m a^v(i)v) 



where the norm is computed for the vector ranging over y a while holding y v fixed, and 
c v = c v + X^oeAT(i) c a and oc indicates that the vector can be normalized. After convergence, 
one can infer the beliefs by 



l/ec„ 



l/eca 



b v (y v ) oc I 4> v (vv) J! 

(yv) 

aeJV(ti) 



Ua (.ya ) OC I <p a 

(Va) II n u -> a (y v 

ueN(a) 



When the factor graph has no cycles, one can use the above message-passing algorithm 
with the Bethe entropy, i.e. c Q = 1 and c v = 1 — \N(v)\ to compute in linear time the soft- 
max, lnZ t (x, y), and the marginal probabilities, Pe(yv\6x,y) and Pe(ya\6x,y), therefore it can 
be used as a subroutine to compute the structure prediction objective Q and gradient Q. 
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Focusing on the Bethe entropy, this message-passing algorithm is the e parameter extension 
of the belief propagation algorithms, which includes as special cases the sum-product for 
e = 1 and the max-product for e = 0. It can be shown that for every positive e this belief 
propagation extension solves exactly ^ for graphs without cycles, since by using a change 
of variables it reduces to the sum-product algorithm for e > 0. 

When the factor graph has cycles this message-passing algorithm has in general no 
guarantees for convergence (unless c a ,c v > 0), nor on the number of iterations, nor on 
the quality of the solution, and it is used as an approximation for the soft-max and the 
marginal probabilities. Therefore, there are two main problems when dealing with graphs 
with cycles and approximate inference: efficiency and accuracy. For graphs with cycles 
there are no guarantees on the number of steps the message-passing algorithm requires till 
convergence, therefore it is computationally costly to run it as a subroutine. Also, as these 
message-passing algorithms have no guarantees on the quality of their solution, the gradient 
and the objective function can only be approximated. Therefore one cannot verify if the 
update of 9 r in the negative approximated gradient direction decreased or increased the 
structured prediction objective. In general, this heuristic results in an algorithm without a 
clear stopping criteria. 

In contrast, in this work we propose to approximate the structured prediction problem 
and to efficiently solve the approximated problem exactly using message-passing. This 
allows us to efficiently learn graphical models with large number of parameters. 



3. Approximate Structured Prediction 

The structured prediction objective in ^ and its gradients defined in Q cannot be com- 
puted efficiently for general graphs since both involve computing the soft-max function and 
the marginal probabilities, which take into account exponentially many elements y € Y. In 
the following we suggest an intuitive approximation for structured prediction, based on its 
dual formulation. 

We believe that a main difficulty in dealing with convex programs, is that special care has 
to be taken to consider the set of feasible solutions, when constructing the dual function. We 
find it simpler to describe the primal program using extended real-valued convex functions, 
which are functions that can get the value of infinity. Intuitively, by using these functions 
we can ignore their domains, simplifying the derivations. The dual programs of extended 
real valued convex functions g : R k ->■ R are formulated in terms of their conjugate dual 

g*(z) = max |/x T z - g(n)j . 
Throughout this work we use the following duality theorem, known as Fenchel duality, 



Fenchel ( 


1951 


); 


Rockafellar 


(1970 


); 


Bertsekas et al. 


(2003) 



Theorem 1 Let $ be a k% x &2 matrix, and let p,e£ R k2 and 6, d G R kl be vectors. The 
following are primal-dual programs: 

(Primal) mm {/ ($ T + e) - d T + h(-0)} 
(Dual) max{-/*(p) + p T e - fc*($p - d)| 
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Proof: We use Lagrange duality theorem, minimizing the function /(/Lt + e) — d T + 
h{—6) subject to the constraints [i = $> T 6. These equality constraints hold for every 
coordinate indexed by {1, ki}, therefore correspond to Lagrange multipliers p £ M fc2 . 
The Lagrangian takes the form 

L(n, 0, p) = f(fi + e) - d T + h{-6) - p T (/x - $ T 0). 

By minimizing with respect to the primal variables min^ g 6, p) we get the dual func- 
tion above. [] 

Since the conjugate dual of the soft-max is the entropy barrier, it follows that the dual 
program for structured prediction is governed by the entropy function of the probabilities 
Px,y(y)- The following duality formulation is known for CRFs when e = 1 with ^ regular- 



ization, and for structured SVM when e = with t\ regularization (Lebanon and Lafferty 



( 2002| >; |Taskar et ah] fl2004D ; |Collins et aL] ( |2008[ )). Here we derive the dual program for 
every e and every if, regularization using conjugate duality: 

Claim 1 The dual program of the structured prediction program in takes the f 



(x, y )eSy€Y 



orm 
q 



where Ay is the probability simplex over y, H(p Xt y) 
function and p x>y e y = Y. y Px, y {y)e-y{y)- 



!>2yPx,y(y)^p x ,y(y) is the entropy 



Proof: The proof follows the one of Theorem [TJ We first describe an equivalent program 
to the one in ^ by adding variables n(x, y) instead of 6 T Q(x, y) to decouple the soft-max 
from the regularization. 



mm 

0,fi(x,y) 
fj,(x, y) = T $(x, y) 



eln^exp 

,(x,y)eS y 



e v(y) + K x >y) 



c, 



o\\ p P 



To maintain consistency, we add the constraints n(x,y) = 6 T &(x,y), for every (x,y) £ S 
and every y £ y. We compute the Lagrangian by adding the Lagrange multipliers p x ,y(y) 

L0= £ ein^v eym+ e Kx ^ -^e+ c \\er p - Y: p x , y (y)(^,y)-e T H^y))- 

(x, y )eS yey ' P (x, y )es, y ey 

The dual function is a function of the Lagrange multipliers, and it is derived by minimizing 
the Lagrangian, namely q(p x ,y) = rnm M) L(fi, 0, p x ,y)- The dual function can be written 

as 



y min 



1 (x,y),y 
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hence it is composed from the conjugate dual of the soft-max and the conjugate dual of the 
£p norm. Recall that the conjugate dual for the soft-max is the entropy barrier eH(p Xty ) 
over the set of probability distributions Ay (cf. Wainwright and Jordan (2008) Theorem 
8.1). Also, the linear shift e y (y) of the soft-max argument results in the linear shift of the 



conjugate dual, thus we get the first part of the dual function ^2(eH(p Xjy ) + ej p. 

1 (cf. 



Simi- 



larly, the conjugate dual of ^||0||p is - \\z 
(1970), page 106), where in our case z = 



q for the dual norm 1/p+l/q 



Rockafellar 



When e = 1 the CRF dual program reduces to the well-known duality relation between 
the log-likelihood and the entropy. When e = we obtain the dual formulation of structured 
SVM which emphasizes the duality relation between the max-function and the probability 
simplex. In general, Claim [T] describes the relation between the soft-max function and the 
entropy barrier over the probability simplex. 

The dual formulation in Claim [T] gives more information on the structured prediction 
program in ([3]), in particular, it demonstrates different connections between structured 
SVMs and CRFs. Both models try to fit a probability distribution Y> x ,y to a prior e y , 
while matching the empirical means to be as close as possible to the learned model means, 
d w Yl(x y)<=s ^ijey Px,y(y)^( x ' V)- However, in CRFs the p X;y are chosen with respect to a 
KL-divergence from the prior exp(— e^), whereas in structured SVMs they are chosen with 
respect to the inner product pj^e^. 

Intuitively, this one-parameter extension implies that we can approximate structured 
SVMs by solving CRFs with prior exp(— e y /e), and weighting the regularization by C l ~ q / (eg), 
while taking e — > 0. This is equivalent to minimizing the dual program 



e • max 



(x,y)eS 



C l-q 



{x,y)eSyeY 



For example, considering the zero-one loss, the prior suggests the dual optimal solution is 
a distribution which is concentrated around the training label, while weighting the regular- 
ization differently. Although this approach is algorithmically unstable for e — > compared 
to the formulation in Q, it may give useful intuition on how to relate both approaches by 
considering different weights on the priors and regularizations. 

The dual program in Claim [l] considers the probabilities p x ,y(y) over exponentially many 
labels y G y, as well as their entropies H(p Xjy ). However, when we take into account the 
graphical model G T)X imposed by the features we observe that the linear terms in the dual 
formulation consider the marginals probabilities p x>y (y v ) and p Xty (y a )- We thus propose to 
replace the marginal probabilities with their corresponding beliefs b X) y )V {y v ), b X)y , a {y a ), and 
to replace the entropy term by the sum of local entropies ^2 Q c a H{b x ^ y ^ a ) + Y^ V c v H{h x ^ y ^ v ). 
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This results in the following approximation of the structured prediction problem 
(approximated structured prediction - dual) 
max _ ^ ec a H(h Xt y t a) + ^ ec v H(h x , y ,v) + ^ {Vv) 



^ ^ bx,y,v(yv}4'r,v(%] Vv) ~\~ ^ ~] b x ^y.aiSlci)4 ) r,oi{x 1 {jot) d 7 



(x, y) £ S, (x, y) £ S, 



subject to 

(y v ) E Ay v , 

bx,y,a{ya) — b x ,y,v 

ya\y v 

Whenever e,c v ,c a > 0, the approximated dual ^ is concave and its dual is a con- 
vex primal program. By deriving the dual of Q we obtain our approximated structured 
prediction, for which we construct an efficient algorithm in Section [4j 



Theorem 2 The approximation of the structured prediction program in |3j) takes the ft 



orm 



min ec„ln^exp 

(x,y)&S,v y v \ / 

i / Sr:ae-B r ^r4>r,a{x,y a ) + ^eJV(o) ^x,y,v^a(yv) \ -p C 

+ 2^ ec Q ln^exp I — l-d - - ||0] 

(x,y)£S,a y a \ " / 

Proof: The proof follows the one of Theorem [TJ We first describe an equivalent program to 
the one in ^ by adding variables z r to decouple the entropies from the moment matching 
constraints. 



max ec a H(b Xj y )0l ) + ec t) if(b a!)y( «) + b x ,y,v(yv)e y ,v(yv) J — ||z — d[|| 

(x,y)<=S \aeE v&V v£V,y v 



q 



subject to the beliefs marginalization constraints, and the consistency constraints 

(x,y)eS,veV r , x ,vv (x,y)eS,aeE r ,x,ya 

We derive the Lagrangian by introducing the Lagrange multipliers \ x ,y,v^a{yv) for every 
marginalization constraint Yly a \y v bx,y,a{ya) = b Xt y >v (y v ), and Lagrange multipliers 6> r for 
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every equality constraint involving z r . In particular, the Lagrangian has the form: 

(\ (Jl-q 
ec a H{h x ^ a ) + ^ ec v H(b (Vv) IN - d\\ q 

a£E ueV veV,y v J 

^ ^x,y,v^a{Vv^ I ^ bx,y,a(ila) 

v,a£N(v),y v \Sa\yv 

We obtain the dual function by minimizing the beliefs over their compact domain, i.e. 

Q.(^x,y,v— >ai @~) — max L(\i X y Vl ^3x,y,ai ^x,y,v— >oti 

bx,y,v(yv)£Ay v , b x>y>a (y a )eAy a 

Deriving the dual by minimizing over the compact set of beliefs enables us to obtain an 
unconstrained dual, which corresponds to the approximated structured prediction program. 
The dual function is described by the conjugate dual functions: 

^ max <ec v H(b Xj y jV ) + ^b Xj y tV (y v ) ie y (y v ) + ^ O r <p rjV (x,y v ) - ^ K,y,v-Hx(yv) 

(x,y)£S,v hx ' y,v& Vv I y v \ r-.vdVr a£N(v) 



+ 2, max < 

(x,y)eS,a 



€ c aH(b x ,y,a) ^ ] ^x,y,g{,yoi) I ^ ] G r (f) rt0l {x^ y a ) + ^ ^ ^x,y,v^a{yv) 

r:a£E r vGN(a) 



+ max < II z — dll -j — z T # 



z 



<7 



Its final form is derived similarly to Claim [TJ where we show that the conjugate dual of the 
entropy barrier is the soft-max function and the conjugate dual of the l\ is the ip. [] 

Comparing the structured prediction in ^ to the approximated structured prediction in 
Theorem [2j we conclude that introducing beliefs to approximated the dual ([£]) is equivalent 
to decomposing the soft-max over y\ , . . . , y n (which is exponential in n) into the sum of soft- 
max over y v and y a . This approximation introduces the messages X X y V — > a (y v ) that are the 
Lagrange multipliers which enforce the local marginalization constraints over the beliefs. 

For the particular case of CRFs (i.e., e = 1) the approximated structured prediction 
decomposes the log-partition function into a sum of efficiently computable log-partition 
functions, while maintaining consistencies using the messages \x,y,v-Hx(yv)- Similarly, for 
e = 0, the approximated structured prediction induces an approximation for structured 
SVMs, decomposing the max-function into a sum of local max-functions. The consistency 
between the separate max function is maintained by the messages \x,y,v-Hx(j)v)- For e — > 
the approximated structured prediction introduces a smooth approximation for the approx- 
imated structured SVMs. This is useful from an algorithmic point of view where one can 
use gradient methods which are in general faster than subgradient methods. 
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4. Message-Passing Algorithm for Approximated Structured Prediction 

In the following we describe a block coordinate descent algorithm for the approximated 
structured prediction program of Theorem [2j Coordinate descent methods are appealing as 
they optimize a small number of variables while holding the rest fixed, therefore they can 
be performed efficiently and can be easily parallelized. Since the primal program is lower 
bounded by the dual program, the primal objective function is guaranteed to converge. 

We begin by describing how to find the optimal set of variables related to a node v in 
the graphical model, namely X x ,y,v^-a(y) for every a G N(v), every y v and every (x, y) G S. 



Lemma 3 Given a vertex v in the graphical model, the optimal X x ,y,v^a(yv) f or every 
a G N(v),y v G y v , (x,y) G S in the approximated program of Theorem^ satisfies 



Hx, y ,a^v{ijv) = ec Q ln{ ^ exp 

\Vo\Vv 



J2r:a&E r , x ^r<Pr,a{ x i Va) + J2u£N(a)\v ^x,y,u^-a(yu) 



for every constant Cx^^^ where c v = c v + Yl a eN(v) ^ n particular, if either e and/or 
c a are zero then [i x ,y,a^v corresponds to the £oo norm and can be computed by the max- 
function. Moreover, if either e and/ or c a are zero in the objective, then the optimal \ x , y ,v-*a 
can be computed for any arbitrary c a > 0, and similarly for c v > 0. 

Proof: For a given x,y and v, optimizing \ x ,y,v->a{S)v) for every a G N{y) and y v G y v 
while holding the rest of the variables fixed, reduces the problem to 



min ec v In exp 



+^/c Q In exp 

aeJV(ii) y a 



e y{Vv) + }Zr:veVr,x Qr4'r,v{ x -,yv) Y^a£N(v) ^x,y,v^a(Vv) 

ec v 



ec n 



Let 

J2r:a&E r ^r<Pr,a(x, {jo) + J2ueN(a)\v ^x,y,u^a{i)u) 



l^x,y,a^vijjv) — C a In ^ exp 

ya\i)v 



ec n 



and also <j> x ,y,v(yv) = e y (y v ) + Ylr:veVr. x ®r(f>r,v(x, yv)- We find the optimal \ x , y ,v-Hx(yv) 
whenever the gradient vanishes, i.e. 

= V { ec a In £ exp ( f^w^M + ) + ec?; b £ e xp ~ E ^ A w^) 



1. For numerical stability in our algorithm We Set C x ,y,v—^a SUCll that /J * Asc,$/,ij— ; taijjv') — 
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Taking the vanishing point of the gradient we derive two probabilities over y v that need to 
be the same, namely 



cxp 



exp 



Ev„ exp 



ftx ,y ,a—tv {_Vv ~)~\~^x ,y ,u— »■ a {Vv ) 



4>x,y,v(yv) — 12{3£N(v) ^x,y,v->p(yv)\ 



^ ex P 



For simplicity we need to consider only the numerator, while taking one degree of freedom 
in the normalization. Taking log of the numerator we deduce that the gradient vanishes if 
the following holds 



^x,y,v— >a 



+ 



(J-x,y,a-^v(yv) + ^x,y,v^a(yv) _4>x,y,v(yv) z2/3eN(v) ^x,y,v-+p(yv) 



(10) 



Multiplying both sides of the equation by c v c a , and summing both sides with respect to 
f3 £ N(v) gives 



^x,y,v^a +C V ^2 (^x,y,^v(Vv) + \e,y,v->/3(yv)) = ^ C/3 
/3&N(v) \0eN(v) 



y x,y,v 



(Vv) — ^x,y,v^tj3{yv 
(11) 



We wish to find the optimal value of \ x ,y,v-Hx(yv), namely the value that satisfies Eq. (10). 
For that purpose we recover the value of J2beN(v) ^x,v,v-*p(yv) from (11): 



-x,y,v—*a 



^ C P <t>x,y,v{yv)-Cv ^2 Vx,y,P->v(yv)- 

K (S&N(v) J 0eN(v) 



Plugging this into 10 gives 



(&)+A (yv) 



'x,y 



^x,y,/3—^v (y v ) +c x,y,v- 



which concludes the proof for e, c a ,c v > 0. Whenever any of these quantitates is zero, Dan- 



skin's theorem (cf. |Bertsekas et al. (2003), Theorem 4.5.1) states that its corresponding 



subgradient is described by a probability distribution over its maximal assignments. There- 



fore if c a = in the objective function, then equality (10) holds for every c a , and similarly 



whenever c v = in the objective, equality holds for every c v . [] 

It is computationally appealing to find the optimal \ x ,y,v->a(yv)- When the optimal 
value cannot be found, one usually takes a step in the direction of the negative gradient 
and the objective function needs to be computed to ensure that the chosen step size reduces 
the objective. Obviously, computing the objective function at every iteration significantly 
slows the algorithm. Since the optimal \ x ,y,v^a(yv) can be found, the block coordinate 
descent algorithm can be executed efficiently in distributed manner, as every \ x ,y,v->a(yv) 
is computed independently. The only interactions occur when computing the normalization 
step c x , VjV ^ a . This allows for easy computation in GPUs. 
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We now turn to describe how to change 6 in order to improve the approximated struc- 
tured prediction. Since we cannot find the optimal 9 r while holding the rest fixed, we 
perform a step in the direction of the negative gradient. We choose the step size r/ to 
guarantee a decent on the objective. 

Lemma 4 The gradient of the approximated structured prediction program in Theorem [S] 
with respect to 9 r equals to 

^ b x , y ,v(.yv)(f>r,v(x,yv) + ^ b x ,y >a (y a )<t>r,a{x, Va) ~ d r + C ■ |# r | p_1 • sign(9 r ), 

(x,y)eS,veV r ,x,yv {x,y)£S ,a£E riX ,y a 



where 

b x ,y,v(yv) oc exp 

b x ,y,a(ya) OC exp 



ec v 

DgJVfo) ^x,y,v— >a{i)v) \ 

€00, ) 



However, if either e and/ or c a equal zero, then the beliefs b Xy y !a (ya) can be taken from the 
set of probability distributions over support of the max-beliefs, namely b Xt y^ a (y^) > only 

tfVa G argmax^ a \^2 r:aeErx 9 r <j) r>a {x,y a ) + E^eiv(a) ^x,y,v^ a (y a )Y Similarly for b x , yjV {y*) 
whenever e and/ or c v equal zero. 

Proof: This is a direct computation of the gradient. In the special case of e, c a = then 
b x ,y,a(ya) corresponds to the subgradient and similarly when e, c v = 0, (Danskin's theorem, 



Bertsekas et al. (2003), Theorem 4.5.1). [| 



The computational complexity of the gradient depends on the structure of the features. 
Since the value of the gradient depends on the beliefs for every v 6 V r>x and a 6 E r ^ x , its 
computation takes \V r ^ x \ + \E rx \ operations. Although this is a major improvement over 
existing methods, it is clear that our framework prefers many features with small graphical 
models rather than few features with large graphical models. Another computational issue 
relates the step size. In general, the coordinate descent scheme verifies that the chosen 
step size rj reduces the objective. Theoretically, for e,c a ,Ci > and p = 2 we can use the 
fact that the gradient is Lipschitz to predetermine a step size r\ that guarantees descent. 
However, in practice it gives worse performance than searching for the step size. 

Lemmas [3] and [4] describe the coordinate descent algorithm for the approximated struc- 
tured prediction in Theorem [2j Figure [T] depicts a summary of the algorithm in the 
belief propagation format, setting n x , y , v ^a(yv) = exp \x,y,v->a(j)v) and m x ^ y ^ v (y v ) = 

exp [Ax,y,a^v 

The coordinate descent algorithm is guaranteed to converge, as it monotonically de- 
creases the approximated structured prediction objective in Theorem [2j which is lower 
bounded by its dual program. However, convergence to the global minimum cannot be 
guaranteed in all cases. In particular, for e = the coordinate descent on the approxi- 
mated structured SVMs is not guaranteed to converge to its global minimum, unless one 
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Message-Passing algorithm for Approximated Structured Prediction: 

Set e ytV (y v ) = exp (e ytV (y v )) and similarly 



J r,v > Yr,a- 



1. For t = 1,2,... 

(a) For every v = 1, ...n, every G 5, every a G N(v), every ^ G 3^ do: 



x,y,ot-^v 



iVv) 



r:a€E r ueN(a)\v 



n 



x,y,v—>a 



(y v ) oc ej, )t) (&,) ] | (p e r ^(x,y r ) |~[ m x , y> ^ v (y v 



rn 



x,y,a^-v \yv 



(i)v 



r-.vGVr 



0£N{v) 

(b) For every r = 1, d do: 

For every (sc, y) G 5, every u G T^, a G E^, every G y v , y a G 3^ set: 

bx,y,v{Vv) OC (ey,r{y v ) \[r:veV r ,x 0%( X i YlaeN(v) n xjy,v^a(yv 

( -g < \ V ec > 

bx,y,a(y a ) OC (^rir-aeEr^ ^ryj ( X > &*) I\v€N(a) n x,y,v^a (& 



1/ecv 



(x,y)ES ,v£Vr,x ,Hv 



bx,y,vi,yv)4'r,v(Xj Vv) 



(x,y)eS,aeE TtX ,y a 



bx,y,a{ya)(f>r,a{x.,y a ) ~ C r + C • |6» r | P 1 ■ Slgn(0 r ) 



Figure 1: The block coordinate descent algorithm for approximated structured prediction 
in Theorem [2| as described in lemmas [3j |4j 



use subgradient methods which are not monotonically decreasing. Moreover, even when 
we are guaranteed to converge to the global minimum, when e,c a ,c v > 0, the sequence 
of variables \ x ,y,v-ta(yv) generated by the algorithm is not guaranteed to converge to an 
optimal solution, nor to be bounded. As a trivial example, adding an arbitrary constant 
to the variables, X X)y>v ^. a (y v ) + c, does not change the objective value, hence the algorithm 
can generate monotonically decreasing unbounded sequences. However, the beliefs gener- 
ated by the algorithm are bounded and guaranteed to converge to the unique solution of 
the dual approximated structured prediction problem. We now summarize the convergence 
properties. 



Claim 2 The block coordinate descent algorithm in lemmas^ and^ monotonically reduces 
the approximated structured prediction objective in Theorem [i| therefore the value of its 
objective is guaranteed to converge. Moreover, ife,c a ,c v > 0, the objective is guaranteed to 
converge to the global minimum, and its sequence of beliefs are guaranteed to converge to 
the unique solution of the approximated structured prediction dual. 
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Proof: The approximated structured prediction dual is strictly concave in the dual vari- 
ables b x , y ,v(yv)} h x ,y,a(j)a)) z subject to linear constraints. The claim properties are a direct 



consequence of Tseng and Bertsekas (1987) for this type of programs. [] 



The convergence result has a practical implication, describing the ways we can estimate 
the convergence of the algorithm, either by the primal objective, the dual objective or the 
beliefs. The approximated structured prediction can also be used for non-concave entropy 
approximations, such as the Bethe entropy, where c a > and c v < 0. In this case the 
algorithm is well defined, and its stationary points correspond to the stationary points of the 
approximated structured prediction and its dual. Intuitively, this statement holds since the 
coordinate descent algorithm iterates over points \ x ,y,v^a(yv),9r with vanishing gradients. 
Equivalently the algorithm iterates over saddle points X x ,y,v^-a(yv), b Xt y tV (y v ),b X; y ta (y a ) and 
9 r , z r of the Lagrangian defined in Theorem [2} Whenever the dual program is concave 
these saddle points are optimal points of the convex primal, but for non-concave dual the 
algorithm iterates over saddle points. This is summarized in the claim below: 

Claim 3 Whenever the approximated structured prediction is not convex, i.e., e,c a > 
and c v < 0, the algorithm in lemmas\^ and\^ is not guaranteed to converge, but whenever 
it converges it reaches a stationary point of the primal and dual approximated structured 
prediction programs. 

Proof: The approximated structured prediction in Theorem [2] is unconstrained. The up- 
date rules defined in Lemmas [3] and [4] are directly related to vanishing points of the gradient 
of this function, even when it is non-convex. Therefore a stationary point of the algorithm 
corresponds to an assignment \ x ,y, v ^a(yv), Or for which the gradient equals zero, or equiv- 
alently a stationary point of the approximated structured prediction. 

The dual approximated structured prediction in ^ is a constrained optimization and 
its stationary points are saddle points of the Lagrangian, defined in Theorem [2j with respect 
to the probability simplex b x ^ <v (y v ) E Ay v and b x ,y, a (ya) £ Ay a . Note that since e, c a , c v ^ 
the entropy functions act as barrier functions on the nonnegative cone, therefore we 
need not consider the nonnegative constraints over the beliefs. In the following we show 
that at stationary points the inferred beliefs of the Lagrangian satisfy the marginalization 
constraints, therefore are saddle points of the Lagrangian. 

When e,c a > the stationary beliefs b XtyjCe (y a ) are achieved by maximizing over Ay a , 
resulting in 



J x,y,a 



(y a ) oc exp 



T,r:aeE rx d r (t>r,a(x,y a ) + E 



However, since c v < the stationary beliefs b x ^y )V (y v ) are achieved by minimizing over Ay v 
resulting in 



b x , y ,v(yv) oc exp 



ec Vl 



To prove these beliefs correspond to a stationary point we show that they satisfy the 
marginalization constraints. This fact is a direct consequence of the update rule in Lemma 
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[3j where by direct computation one can verify that 

^ b x ,y,a(y a ) OC exp 
$a\Vv 

Following the definition of b Xt y jV (y v ) one can see that the update rule in Lemma[3]enforces the 
marginalization constraints. This implies that the gradient of the approximated structured 
prediction program measures the disagreements between y] « \^ bxy^aiila) and bx : y t vijjv)i 
and the gradient vanishes only when they agree. Therefore these beliefs correspond to a 
saddle point of the Lagrangian. [] 

The order of the updates in the algorithm in Figure [T] is not important to guarantee 
the convergence properties in Claims [2j [3} For example, one can perform the updates of 
the messages X x ,y,v^-a{yv) until no changes can be made, resulting in beliefs which agree on 
their marginal probabilities, and then perform an update step for 8 r . This method is closely 
related to the heuristic for solving structured prediction tasks, namely CRFs and structured 
SVMs, with approximated inference engine. This heuristic runs an approximate inference 
engine to infer beliefs which agree on their marginal probabilities, and use them to update 
the 8 r . However, there are two important differences between these two approach in their 
accuracy and efficiency: The algorithm in Figure [I] solves the approximated structured 
prediction accurately, since it finds a step size r] for the update of 8 r that reduces the 
approximated structured prediction objective ([8]). On the other hand, when using the 
approximate inference heuristic, one cannot determine a step size rj to reduce the CRFs and 
structured SVMs objectives, since these objectives cannot be computed accurately for graph 
with cycles. The algorithm in Figure [I] is also more efficient from the structured prediction 
heuristic, since it describes a way to update 9 r even if the inferred beliefs do not agree on 
their marginal probabilities, or equivalently \ x ,y,v^a{Vv) did not reach a stationary point. 
This is based on our theoretical framework in Lemmas [3j [4j which supports performing 
small number of approximate inference updates of X x ,y,v^a(yv)- These updates re-uses the 
values of previous iterations to extract intermediate beliefs b X) y } v(yv):bx,y,a (y a ), which not 
necessarily agree on their marginal probabilities, in order to optimize 9 r . This is in contrast 
to running the approximated inference heuristic, which do not have a principled way to re- 
use previous computations and its beliefs are used for optimizing 9 r only after convergence, 
which is computationally intensive as a subroutine. 

5. Experimental evaluation 

We performed experiments on 2D grids since they are widely used to represent images, and 
have many cycles. We first investigate the role of e in the accuracy and running time of our 
algorithm, for fixed c a ,c v = 1. We used a 10 x 10 binary image and randomly generated 
10 corrupted samples flipping every bit with 0.2 probability. We trained the model using 
e = {1,0.5,0.01,0}, ranging from approximated CRFs (e = 1) to approximated structured 
SVM (e = 0) and its smooth version (e = 0.01). The runtimes are 323, 324, 326, 294 seconds 
for e = {1,0.5,0.01,0} respectively. As e gets smaller the runtime slightly increases, but 
it decreases for e = since the £oo norm is efficiently computed using the max function. 
However, e = is less accurate than e = 0.01; When the approximated structured SVM 



l^x,y ,a^v\ijv) ~\~ ^x,y,v^a{yv) 



ec, 



18 



Approximated Structured Prediction 





Gaussian noise 


Bimodal noise 




h 


h 


h 


h 


h 


h 


h 


h 


LBP-SGD 


2.7344 


2.4707 


3.2275 


2.3193 


5.2905 


4.4751 


6.8164 


7.2510 


LBP-SMD 


2.7344 


2.4731 


3.2324 


2.3145 


5.2954 


4.4678 


6.7578 


7.2583 


LBP-BFGS 


2.7417 


2.4194 


3.1299 


2.4023 


5.2148 


4.3994 


6.0278 


6.6211 


MF-SGD 


3.0469 


3.0762 


4,1382 


2.9053 


10.0488 


41.0718 


29.6338 


53.6035 


MF-SMD 


2.9688 


3.0640 


3.8721 


14.4360 










MF-BFGS 


3.0005 


2.7783 


3.6157 


2.4780 


5.2661 


4.6167 


6.4624 


7.2510 


Ours 


0.0488 


0.0073 


0.1294 


0.1318 


0.0537 


0.0244 


0.1221 


0.9277 



Figure 2: Gaussian and bimodal noise: Comparison of our approach to loopy belief 
propagation and mean field approximations when optimizing using BFGS, SGD 
and SMD. Note that our approach significantly outperforms all the baselines. 
MF-SMD did not work for Bimodal noise. 



converges, the gap between the primal and dual objectives was 1.3, and only 10 -5 for e > 0. 
This is to be expected since the approximated structured SVM is non-smooth (Claim [2]). 

We generated test images in a similar fashion. When using the same e for training and 
testing we obtained 2 misclassifications for e > and 109 for e = 0. We conjecture that 
this comes from the non-zero primal-dual gap of e = 0. We also evaluated the quality of the 
solution using different values of e for training and inference, following Wainwright (2006). 
When predicting with smaller e than the one used for learning the results are marginally 
worse than when predicting with the same e. However, when predicting with larger e, the 
results get significantly worse, e.g., learning with e = 0.01 and predicting with e = 1 results 
in 10 errors, and only 2 when e = 0.01. 

The main advantage of our algorithm is that it can efficiently learn many parameters 
in a graphical model. We now compared, in a similarly generated dataset of size 5 x 5, a 
model learned with different parameters for every edge and vertex (~ 300 parameters) and a 
model learned with parameters shared among the vertices and edges (2 parameters for edges 
and 2 for vertices) used by Kumar and Hebert (2003). Using large number of parameters 
increases performance: sharing parameters resulted in 16 misclassifications, while optimizing 
over the 300 parameters resulted in 2 errors. Our algorithm avoids overfitting in this case, 
we conjecture it is due to the regularization. 

We now compare our approach to state-of-the-art CRF solvers. We use the binary image 
dataset of Kumar and Hebert (2003) that consists of 4 different 64 x 64 base images. Each 



base image was corrupted 50 times with each type of noise. Following Vishwanathan et al. 



(2006), we trained different models to denoise each individual image, using 40 examples for 
training and 10 for test. We compare our approach to the result of approximating the condi- 
tional likelihood using loopy belief propagation (LBP) and mean field approximation (MF). 
For each of these approximations, we use stochastic gradient descent (SGD), stochastic 
meta-descent (SMD) and BFGS to learn the parameters. We do not report pseudolikeli- 
hood (PL) results since it did not work. Note that the same behavior of PL was noticed by 



Vishwanathan et al. (2006). To reduce the computational complexity and the chances of 



convergence, Kumar and Hebert (2003); Vishwanathan et al. ( |2006 ) forced their parameters 
to be shared across all nodes such that Vi, 9{ = 9^ and Vi, Vj £ N(i), 9{j = 9 e . In contrast, 



19 



Hazan and Urtasun 





•'/VCOMNI 

b\vis ps 




-7VCOMNI 
fc\VIS P S 



Figure 3: Denoising results: Gaussian (left) and Bimodal (right) noise. 



since our approach is efficient, we can exploit the full flexibility of the graph and learn more 
than 10, 000 parameters. Note that this is computationally prohibitive with the baselines. 
For the local features we simply use the pixel values, and for the node potentials we use an 
Ising model with only bias features such that fa = [1, —1; —1, 1]. For all experiments we 
use e = 1, and p = 2. For the baselines, we use the code, features and optimal parameters 



of Vishwanathan et al. (2006). 



Under the first noise model, each pixel was corrupted via i.i.d. Gaussian noise with 
mean and standard deviation of 0.3. Fig. [2] depicts test error in (%) for the different 
base images (i.e., I\, . . . , Note that our approach outperforms considerably the loopy 
belief propagation and mean field approximations for all optimization criteria (BFGS, SGD, 
SMD). For example, for the first base image the error of our approach is 0.0488%, which 
is equivalent to a 2 pixels error on average. In contrast the best baseline gets 112 pixels 
wrong on average. Fig. [3] (left) depicts test examples as well as our denoising results. Note 
that our approach is able to cope with large amounts of noise. 

Under the second noise model, each pixel was corrupted with an independent mixture 
of Gaussians. For each class, a mixture of 2 Gaussians with equal mixing weights was used, 
yielding the Bimodal noise. The mixture model parameters were (0.08, 0.03) and (0.46, 0.03) 
for the first class and (0.55, 0.02) and (0.42, 0.10) for the second class, with (a, b) a Gaussian 
with mean a and standard deviation b. Fig. [2] depicts test error in (%) for the different base 
images. As before, our approach outperforms all the baselines. We do not report MF-SMD 
results since it did not work. Denoised images are shown in Fig. [| (right). We now show 
how our algorithm converges in a few iterations. Fig. [4] depicts the primal and dual training 
errors as a function of the number of iterations. Note that our algorithm converges, and 
the dual and primal values are very tight after a few iterations. 



6. Related work 

We now discuss related work. For the special case of CRFs, the idea of approximating 



the entropy function with local entropies was used by Wainwright (2006); Ganapathi et al. 



(2008). In particular, Wainwright (2006) proved that using a concave entropy approxima- 



tion gives robust prediction. Ganapathi et al. (2008) used the non-concave Bethe entropy 
approximation c Q = 1, c v = 1 — |iV(z;)| as well as the concave approximation c a = l,c v = 0. 
Our work differs from these works in two aspects: we derive an efficient algorithm in Sec- 
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Figure 4: Convergence. Primal and dual train errors when for I\ is corrupted with Gaus- 
sian and Bimodal noise. Our algorithm is able to converge in a few iterations. 



tion[4]for the concave approximated program (c Q , c v > 0) and our framework and algorithm 
include structured SVMs, as well as their smooth approximation when e — > 0. 

Some forms of approximated structured prediction were investigated for the special case 



of CRFs. Sutton and McCallum (2009) described a similar program, but without the La- 
grange multipliers \ x ,y,v-^a{Vv) and no regularization, i.e., C = 0. As a result the local 
log-partition functions are independent, and efficient counting algorithm can be used for 
learning. Ganapathi et al. (2008) derived an approximated program for c a = 1, c„ = 



without regularization which was solved by the BFGS convex solver. Also, the constraints 



of Ganapathi et al. (2008) were composed differently which lead to a different dual formula- 



tion. Our work is different as it considers efficient algorithms for approximated structured 
prediction, and takes advantage of the graphical model by sending messages along its edges. 
We show in the experiments that this significantly improves the run-time of the algorithm. 
Also, our approximated structured prediction includes as special cases approximated CRF, 
for e = 1, and approximated structured SVM, for e = 0. Moreover, we describe how 
to smoothly approximate the structured SVMs to avoid the shortcomings of subgradient 
methods, by simply setting e — > . 



7. Conclusion and Discussion 

In this paper we have related CRFs and structured SVMs and shown that the soft-max, 
a variant of the log-partition function, approximates smoothly the structured SVM hinge 
loss. We have also proposed an approximation for structured prediction problems based 
on local entropy approximations and derived an efficient message-passing algorithm that is 
guaranteed to converge, even for general graphs. We have demonstrated the effectiveness 
of our approach to learn graphs with large number of parameters in an image denoising 
task. In the future we plan to investigate other domains of application such as image 
segmentation. 
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